Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Our data has 4898 rows or data for 4898 different white wines across 13 variables (columns)
We will now make univariate plots for all variables to identify trends and outliers.
UNIVARIATE PLOTS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Most white wines in our data have a fixed acidity between 6.3 and 7.3.There are very few wines with a fixed acidity greater than 9 and less than 5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Most white wines have a volatile acidity between 0.2 to 0.3. There is a long tail of values going upto 1.1. Transforming the data on a log scale to better understand the long tail:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
On a log scale, data looks closer to a normal curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid in white wines has a less spread with most wines contained between 0.2 and 0.4. There is also wines with citric acid = 0, which makes me wonder that maybe citric acid is not essential to a wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar is more dominant in the lower side.There are not many very sweet wines out there. There is a clear peak in the data at 5.2- I wonder if the sugar content is the property of the grapes or it is an additive and if that can explain this peak.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides indicate the amount of salt in the wine. Chlorides in most wines range from 0.03 to 0.04 - a very small range. There seems to be a few wines that are quite salty - I wonder how it influences wine quality and taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Free Sulfur dioxide content has a more even spread with wines distributed equally on both lower and higher side of the median value. Snce it prevents oxidation of the wine, I wonder what influences its quantity per unit volume of the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfur dioxide is also spread more evenly, and the graph looks very much like a normal curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
There are a few outliers with denisty over 1, but most white wines have a density between 0.99 and 1. Density has a very low variation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH is spread across a small range and follows almost a normal distribution with a range between 2.7 and 3.8.But all in all wines are acidic.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Since sulphates are additives, there quantity must be dependent on the manufacturer or the creator of the wine. Unlike other characteristics, sulphates are probably decided by direct human intervention to prevent microbial growth. There are two peaks here, and the spread is also quite large. I wonder to what extent higher quantity of sulphates affect the quality of the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Percent alcohol content is highly variable. There is no clear trend or a single peak. Since alcohol content is very important for taste and experience, I wonder what really causes it to vary across the wines and how exactly it correlates with quality.
Quality rating vaires from 3 to a maximum of 9. We have very few lower and higher quality wines. Most wines taste an average 6. I wonder what makes the wines really good or really bad.
We will convert the data type of quality variable into a factor variable.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
As can be observed, the data has a lot of medium quality wines (5,6, and 7), and very few low quality or high quality wines.
BIVARIATE ANALYSIS
Now that we have plotted all the variables individually - we get a sense that most variables follow a close to normal curve and there are few outliers. We will now plot these variables against each other to get a sense of how they all come together to define the quality of a wine.
We will first plot all parameters against each other using the ggpairs function to get a sense of the whole data.
Looking closely at the quality variable, we find that it has no strong correlation with any single variable. Different variables vary across a different range for the different quality levels. We will explore this behavior in the next few graphs.
WINE QUALITY AND FIXED ACIDITY
Fixed acidity is spread across a wide range for all wines, for wine quality 9 it is tending towards 7 and does not have extreme values as the others. But since we have very little data for the quality 9 wines, we cannot say how fixed.acidity determines wine quality. We can see the large variation in the lowest quality(3) wine.
WINE QUALITY AND VOLATILE ACIDITY
Volatile acidity, as with fixed acidity varies across a wider range for ‘average quality’ wines, but for wine quality 9 it has a lower range. We are beginning to get a sense that higher quality wines probably have less “extreme” values. And all wines have similar median values. We will verify this by plotting boxplot graphs for other variables.
WINE QUALITY AND CITRIC ACID
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
As with acidity, we see that with citric.acid, the box size of the lower quality wines is larger than that of higher quality wines. Citric acid adds freshness to the wines, as detailed in the text description file. But it does not show any marked increase in higher quality wines. For lower quality wines, it again shows higher variation in values. Drawing a similar graph for other parameters:
WINE QUALITY AND RESIDUAL SUGAR
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
We see that high quality wines are lower on sugar content (lower median value).
Similarly, plotting for other parameters:
WINE QUALITY AND OTHER PARAMETERS
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 76 rows containing non-finite values (stat_boxplot).
Higher quality wines seem to have higher alcohol content. Plotting this further:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The median value of alcohol content is higher for higher quality wines
Other than alcohol, all the other graphs do point us in the same direction as before, that higher quality wines seem to have no extreme values. We will analyse this further by looking at each quality wine separately and analysing its data across various variables.
We will melt the dataset and calculate standard deviation for each variable with each quality wine. Then we will plot the standard deviation of each variable separately against the quality of wine to see if there is a trend.
## Warning: Removed 20 rows containing missing values (geom_point).
We can observe that for free and total sulphir dioxides, the trend of increaisng variance in values with decreasing quality is very clear. Now we will plot other variables separately to confirm this trend for other variables.
From the graphs above, we can conclude that if residual.sugar, citric.acid, chlorides, free sulphur dioxide, total sulphur dioxide, volatile acidity, fixed acidity, and an extreme value of pH affects the quality rating of wine negatively. More consistent values across all of these variables (close to the averages as the “mean” in the table) is what makes a wine great in temrs of quality ratings.
Now looking at the remaining variables:
Sulphates, Density,and Alcohol do not show a trend with the standard deviation. Alcohol content, as we have seen before was higher for good quality wines. Now we will look at the other two variables: Sulphates and Density.
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
We are unable to see a trend with either the sulphates and the density. Since sulphate is an additive and not an inherent wine property, maybe it is added in such a way that it improves the overall quality of any wine and that is why we do not see a trend with changing wine quality rating.
Density is also highly correlated to residual sugar content of the wine. The description also says it is connected to the alcohol content. Since it is a derived variable, it is difficult to see a clear trend for it in the various wines.
Apart from that we are unable to comment on the influence of these variables on wine quality with the dataset that we have.
The distribution of white wines is normal indicating that it is rare to find really good quality or really poor quality wines and there are a lot of medium quality wines.
We found that consistent values across (most) variables (close to the averages as the “mean” or “median”) is what makes a wine great in temrs of quality ratings. As can be seen, there is large variation, or very high or very low values of fixed.acidity, volatile.acidity, citric.acid, residual sugar, chlorides, free sulfur dioxide, total sulphur dioxide, and ph for lower quality wines. The distribution of standard deviation of variables across the different quality wines is a downward sloping curve.
Higher quality wine have a higher median value of % by volume alcohol content. The correlation coefficient for alcohol content and wine quality is also high (0.8) indicating a string relationship between the two.
A quick look into the data revealed that median values of most variables remained same across various wines. I hence found it challenging that there was no apparant relationship between any single variable and wine quality. But with deeper analysis I succeeded in drawing out a trend where higher quality wines had a low spread away from the median values.Also after reading the description of the various variables in the dataset, I got a sense that with a very high or very low quantity of any of the variables - acidity, sugar, or sulphides - the taste of the wines gets spoilt - much like any other food where a lot of salt or sugar or spice would make it unpalatable. I then continued my analysis to establish this by melting the data and checking the spread of the variables across different quality of wines - and was very glad when the data insights corroborated my assumptions.
It also found it challenging that the data is very limited for some wines. This makes it hard to draw insights conclusively. For example, there was data for only 5 highest quality wines.
Another challenge was on the formatting front. The font sizes, graph alignments, and text spacing were new things for me to learn and I spent significant amount of time on searching the internet on how to best format the document.
I think I succeeded in adequately explaining the relationships of various factors and wine quality.
Overall, As I went about exploring the dataset, I learnt a lot about the factors influencing the quality of wine. I was surprised to see that alcohol content is higher for higher quality wines (and not just the taste). I am surely going to include that in my purchase decision next time I buy wine.